seo

Remarkable Openness from Google’s Black Box Thanks to Saul Hansel

I’m more than a little skeptical of mainstream media articles about the search engines. With so many terrible experiences – inaccuracy, bias, shallow information, agenda-based reporting – it’s easy to see why. However, today I’m thrilled to see an article from Saul Hansel in the NY Times that’s not only impeccably well-written, but informative to even those of in most deeply inside the search industry. The article – Google Keeps Tweaking Its Search Engine – is quite possibly the best mainstream media article about Google, or modern search technology, in the last 5 years.

There are several big takeaways for search marketers, so let’s dive right in:

Mr. Singhal is the master of what Google calls its “ranking algorithm” — the formulas that decide which Web pages best answer each user’s question. It is a crucial part of Google’s inner sanctum, a department called “search quality” that the company treats like a state secret. Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine.

Google values Mr. Singhal and his team so highly for the most basic of competitive reasons. It believes that its ability to decrease the number of times it leaves searchers disappointed is crucial to fending off ever fiercer attacks from the likes of Yahoo and Microsoft and preserving the tidy advertising gold mine that search represents.

It’s nice to hear that Google feels much the same way I do about search quality – in particular that the current competitive advantage is primarily about the relevance of results. We’re also getting a peak at a Googler that we’ve never met before (at least, outside the ‘plex). I’m guessing that poor Mr. Singhai is now receiving quite a few emails to every possible variation of his names @ google.com (poor guy).

Any of Google’s 10,000 employees can use its “Buganizer” system to report a search problem, and about 100 times a day they do — listing Mr. Singhal as the person responsible to squash them.

“Someone brings a query that is broken to Amit, and he treasures it and cherishes it and tries to figure out how to fix the algorithm,” says Matt Cutts, one of Mr. Singhal’s officemates and the head of Google’s efforts to fight Web spam, the term for advertising-filled pages that somehow keep maneuvering to the top of search listings.

Some complaints involve simple flaws that need to be fixed right away. Recently, a search for “French Revolution” returned too many sites about the recent French presidential election campaign — in which candidates opined on various policy revolutions — rather than the ouster of King Louis XVI. A search-engine tweak gave more weight to pages with phrases like “French Revolution” rather than pages that simply had both words.

The Google bug system reminds us that behind all the magic, human beings toil to ensure quality, compare individual results and make tweaks based upon the best aggregate changes. The short paragraph about the French Revolution, if accurate, gives some insight into the fact that the algorithm is not uniform – not even close. Individual queries get individual attention – so next time you’re stumped because Google’s formula for some new term you’re optmizing doesn’t match up against your experiences from the past, you may simply be dealing with a different set of criteria.

But Mr. Singhal often doesn’t rush to fix everything he hears about, because each change can affect the rankings of many sites. “You can’t just react on the first complaint,” he says. “You let things simmer.”

So he monitors complaints on his white board, prioritizing them if they keep coming back. For much of the second half of last year, one of the recurring items was “freshness.”

Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.

But last year, Mr. Singhal started to worry that Google’s balance was off. When the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them.

Hmmmm… Google not showing fresh results, eh? Sounds mighty familiar, no? We at SEOmoz, and most of the rest of the informed SEO world had been talking about this for the last few years; in particular since March of 2004 when the infamous “sandbox” first reared its ugly head. It’s nice to get confirmation and feel the vindication of this transparency, but there’s also a lesson to be learned – Google isn’t perfect and they often look inward. The note that this problem wasn’t addressed until the query “Google Finance” didnt’ show “Google Finance” is strong evidence that Google is like many other companies. Things don’t get fixed unless the folks internally feel the pain of the problem. Thus, next time you want to fight with the Google engineers about what you feel is inequitable treatment in the SERPs, the best way to do it might be to illustrate how the problem affects Google products.

Mr. Singhal introduced the freshness problem, explaining that simply changing formulas to display more new pages results in lower-quality searches much of the time. He then unveiled his team’s solution: a mathematical model that tries to determine when users want new information and when they don’t. (And yes, like all Google initiatives, it had a name: QDF, for “query deserves freshness.”)…

…“What do you take us for, slackers?” Mr. Singhal responded with a rebellious smile.

THE QDF solution revolves around determining whether a topic is “hot.” If news sites or blog posts are actively writing about a topic, the model figures that it is one for which users are more likely to want current information. The model also examines Google’s own stream of billions of search queries, which Mr. Singhal believes is an even better monitor of global enthusiasm about a particular subject.

As an example, he points out what happens when cities suffer power failures. “When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds,” he says.

Mr. Singhal says he tested QDF for a simple application: deciding whether to include a few news headlines among regular results when people do searches for topics with high QDF scores. Although Google already has a different system for including headlines on some search pages, QDF offered more sophisticated results, putting the headlines at the top of the page for some queries, and putting them in the middle or at the bottom for others.

In the SEO world, we’re all familiar with the new onebox results that pop up with news results, and now we’ve got a bit of backstory on it. I also suspect that although it wasn’t mentioned in the article, there may have been some tweaking to the organic listings to help support more freshness in the results themselves. Google’s still favoring a lot of old results, but of the thousand or so queries we monitor internally and for clients, there’s at least some indications that a freshness boost exists.

Another big takeaway here is the thought process about how temporal data and query analysis happens at the ‘plex. The level of awareness of satisfaction with results is certainly impressive, and so is the exceptionally fast timeline for fixes (at least, some fixes – in SEO, we’ve got our own examples of tortoise-speed implementation). What the article says, though, is that Google can determine, by examining blog posts and news articles, what topics and queries might be getting “hot’ and return more “fresh” results for those queries. This fits in precisely with how smart SEOs advise on “escaping” from the sandbox – get lots of link love and lots of people talking about you, i.e. become newsworthy.

As Google compiles its index, it calculates a number it calls PageRank for each page it finds…

…Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years…

…Increasingly, Google is using signals that come from its history of what individual users have searched for in the past, in order to offer results that reflect each person’s interests. For example, a search for “dolphins” will return different results for a user who is a Miami football fan than for a user who is a marine biologist. This works only for users who sign into one of Google’s services, like Gmail…

…Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names…

…These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. A page about President Bush’s speech about Darfur last week at the White House, for example, would rank high in topicality for “Darfur,” less so for “George Bush” and even less for “White House.” Google combines all these measures into a final relevancy score.

The sites with the 10 highest scores win the coveted spots on the first search page, unless a final check shows that there is not enough “diversity” in the results. “If you have a lot of different perspectives on one page, often that is more helpful than if the page is dominated by one perspective,” Mr. Cutts says. “If someone types a product, for example, maybe you want a blog review of it, a manufacturer’s page, a place to buy it or a comparison shopping site.”

Wow… OK – 200 signals of quality (we’ve covered a lot of the big ones here), a classification system that attempts to determine query intent and an automated system to determine diversity. That’s a lot of confirmation about what many have only theorized until now. I’m not going to go into detail about each of these – I invite you to do so in the comments – but, I’ll certainly be writing about them sometime in the near future.

SMX starts tomorrow and between Lisa Barone, Andy Beal & SERoundtable, I think there’s going to be a heap of coverage. I’ve asked the mozzers covering (Jane & Rebecca), to do their best to be as thorough and thoughtful as possible – they’ll try to present you with as much signal as possible, and most of the “advanced” topics, rather than disgorge everything from every session. Meanwhile, I’ll be on active duty, presenting, networking, listening and learning and do my best to bring back valuable information as well.

And yes, Rebecca’s making comics…

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button